It is enough to call summary() on each column of swiss. This can be done in a functional programming style using package purrr. The collections of summaries can be rearranged so as to build a dataframe that is fit for reporting.
We have to pick some graphical summary of the data. Boxplots and violine plots could be used if we look for concision.
We use histograms to get more details about each column.
Not that covariates have different meanings: Agriculture, Catholic, Examination, and Education are percentages with values between \(0\) and \(100\).
We have no details about the standardized fertility index Fertility
Infant.Mortality is also a rate:
Infant mortality is the death of an infant before his or her first birthday. The infant mortality rate is the number of infant deaths for every 1,000 live births. In addition to giving us key information about maternal and infant health, the infant mortality rate is an important marker of the overall health of a society.
We reuse the function we have already developped during previous sessions.
Code
make_biotifoul(swiss, .f = is.numeric)
Histograms reveal that our covariates have very different distributions.
Religious affiliation (Catholic) tells us that there two types of districts, which is reminiscent of the old principle Cujus regio, ejus religio , see Old Swiss Confederacy.
Agriculture shows that in most districts, agriculture was still a very important activity.
Education reveals that in all but a few districts, most children did not receive secondary education. Examination shows that some districts lag behind the bulk of districts. Even less exhibit a superior performance.
The two demographic variables Fertility and Infant.Mortality look roughly unimodal with a few extreme districts.
Investigate correlations
Compute, display and comment the sample correlation matrix.
Display jointplots for each pair of variables.
solution
Package corrr, functions correlate and rplot provide a conveniemt tool.
Note that rplot() creates a graphical object of class ggplot. We can endow it with more layers.
Code
corrr::correlate(swiss) %>% corrr::rplot() %>%+ggtitle("Correlation plot for Swiss Fertility data")
The high positive linear correlation between Education and Examination is moderately surprising. The negative correlation between the proportion of people involved in Agriculture and Education and Examinationis also not too surprising. Secondary schooling required pupils from rural areas to move to cities.
A more intriguing observation concerns the pairs Catholic and Examination (negative correlation) and Catholic and Education (little correlation).
The response variable Fertility looks negatively correlated with Examination an Education. These correlations are worth being further explored. In Demography, the decline of Fertility is often associated with the the rise of women education. Note that Examination is about males, and that Education does not give details about the way women complete primary education.
Perform PCA on covariates
Pairwise analysis did not provide us with a clear and simple picture of the French-speaking districts.
Play with centering and scaling
We first call prcomp() with the default arguments for centering and scaling, that is, we center columns and do not attempt to standardize columns.
solution
Code
pco <- swiss %>%select(-Fertility) %>%prcomp()
The result
solution
Hand-made centering of the dataframe
Code
X <-select(swiss, -Fertility)n <-nrow(X)Y <- (X -matrix(1, nrow = n, ncol=1) %*%rep(1/n,n) %*%as.matrix(X)) Y <-as.matrix(Y)
Function scale(X, scale=F) from base R does the job.
solution
Code
svd_Y <-svd(Y)svd_Y %$% (as.matrix(Y) - u %*%diag(d) %*%t(v)) %>%norm(type ="F") # <1> checking the factorization
[1] 2.054251e-13
Code
norm( diag(1, ncol(Y)) - (svd_Y %$% (t(v) %*% v)), 'F') # <2> checking that colomns of `v` frm an orthonormal family.
[1] 1.261261e-15
Note that we used the exposing pipe %$% from magrittr to unpack svd_Y which is a list with class svd and members named u, d and v.
We could have used with(,) from base R.
solution
The matrix \(1/n Y^T \times Y\) is the covariance matrix of the covariates. The spectral decomposition of the symmetric Semi Definite Positive (SDP) matrix \(1/n Y^T \times Y\) is related with the SVD factorization of \(Y\).
The spectral decomposition of \(Y^T \times Y\) can be obtained using eigen.
Code
(t(eigen(t(Y) %*% Y )$vectors) %*% svd_Y$v ) %>%round(digits=2)
Here, the eigenvectors of \(Y^T \times Y\) coincide with the right singular vectors of \(Y\) corresponding to non-zero singular values. Up to sign changes, it is always true when the non-zero singular values are pairwise distinct.
Now we check that prcomp is indeed a wrapper for svd.
Mind the braces on the right side of the first pipe
3
1- percent tell the reader about the relative Frobenious error achieved by keeping the first components of the SVD expansion.
Code
pco %>%p_screeplot() +labs(title="Screeplot for swiss fertility data",caption="Keeping the first two components is enough to achieve relative Froebenius relative error 3.3%")
Project the dataset on the first two principal components (perform dimension reduction) and build a scatterplot. Colour the points according to the value of original covariates.
solution
Code
p <- pco %>%augment(swiss) %>%ggplot() +aes(x=.fittedPC1, y=.fittedPC2, label=.rownames) +geom_point() +coord_fixed() + ggrepel::geom_text_repel() (p +aes(color=Infant.Mortality)) +(p +aes(color=Education)) +(p +aes(color=Examination)) +(p +aes(color=Catholic)) +(p +aes(color=Agriculture)) +(p +aes(color=Fertility)) +plot_layout(ncol =2) +plot_annotation(title="Swiss data on first two PCs" , subtitle ="centered, unscaled")
solution
We can extract factor \(V\) from the SVD factorization using generic function tidy from package broom
\(X\) : data matrix after column centering (use scale(., center=T, scale-F))
\[X\]
solution
Code
X <-as.matrix(select(swiss, -Fertility)) |>scale(center = T, scale=F)# check centering, spot the difference in variances X |>as_tibble() |>summarise(across(everything(), c(var, mean)))
# check the left singular vectorspco$x %*%diag((pco$sdev)^(-1)) |>as_tibble() |>summarise(across(everything(), c(mean,var)))
Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if
`.name_repair` is omitted as of tibble 2.0.0.
ℹ Using compatibility `.name_repair`.
Plot again the correlation circle using the same principal axes as before, but add the Fertility variable. How does Fertility relate with covariates? with principal axes?
solution
Code
U <- pco_cs %$%# exposition pipe (1/sqrt(nrow(x)-1) *x %*%diag((sdev)^(-1)))Uprime <-with(pco_cs, 1/sqrt(nrow(x)-1) *x %*%diag((sdev)^(-1)))t(U) %*% U